<!DOCTYPE html>
<html lang="en-US">
  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width,initial-scale=1">
    <title>python爬取今日头条图集 | Finen&#39;s Blog</title>
    <meta name="generator" content="VuePress 1.4.1">
    <link rel="icon" href="/logo.png">
    <link rel="manifest" href="/manifest.json">
    <link rel="apple-touch-icon" href="/icons/apple-touch-icon-152x152.png">
    <link rel="mask-icon" href="/icons/safari-pinned-tab.svg" color="#3eaf7c">
    <meta name="description" content="Stay Hungry! Stay Foolish!">
    <meta name="theme-color" content="#3eaf7c">
    <meta name="apple-mobile-web-app-capable" content="yes">
    <meta name="apple-mobile-web-app-status-bar-style" content="black">
    <meta name="msapplication-TileImage" content="/icons/msapplication-icon-144x144.png">
    <meta name="msapplication-TileColor" content="#000000">
    <link rel="preload" href="/assets/css/0.styles.331bc503.css" as="style"><link rel="preload" href="/assets/js/app.8657dc36.js" as="script"><link rel="preload" href="/assets/js/3.655fda8f.js" as="script"><link rel="preload" href="/assets/js/74.d36a7a16.js" as="script"><link rel="preload" href="/assets/js/11.32858d9c.js" as="script"><link rel="prefetch" href="/assets/js/1.af5e15a9.js"><link rel="prefetch" href="/assets/js/10.3d57a2e0.js"><link rel="prefetch" href="/assets/js/12.cef9d203.js"><link rel="prefetch" href="/assets/js/13.02beb1f0.js"><link rel="prefetch" href="/assets/js/14.7efc03c9.js"><link rel="prefetch" href="/assets/js/15.6bbbab3f.js"><link rel="prefetch" href="/assets/js/16.f90edb71.js"><link rel="prefetch" href="/assets/js/17.0d4d6c13.js"><link rel="prefetch" href="/assets/js/18.76dd8979.js"><link rel="prefetch" href="/assets/js/19.7da82e6f.js"><link rel="prefetch" href="/assets/js/20.55ee1642.js"><link rel="prefetch" href="/assets/js/21.088a02d1.js"><link rel="prefetch" href="/assets/js/22.73d5a697.js"><link rel="prefetch" href="/assets/js/23.829e24c3.js"><link rel="prefetch" href="/assets/js/24.2f258433.js"><link rel="prefetch" href="/assets/js/25.0a52df22.js"><link rel="prefetch" href="/assets/js/26.1c66cc2f.js"><link rel="prefetch" href="/assets/js/27.aa781144.js"><link rel="prefetch" href="/assets/js/28.b2cf7947.js"><link rel="prefetch" href="/assets/js/29.b61b1598.js"><link rel="prefetch" href="/assets/js/30.ba0ae8b1.js"><link rel="prefetch" href="/assets/js/31.7637d043.js"><link rel="prefetch" href="/assets/js/32.e552a016.js"><link rel="prefetch" href="/assets/js/33.334376b0.js"><link rel="prefetch" href="/assets/js/34.08c95f99.js"><link rel="prefetch" href="/assets/js/35.76b6b95d.js"><link rel="prefetch" href="/assets/js/36.539b6d80.js"><link rel="prefetch" href="/assets/js/37.850783a5.js"><link rel="prefetch" href="/assets/js/38.06c33708.js"><link rel="prefetch" href="/assets/js/39.85fc61d5.js"><link rel="prefetch" href="/assets/js/4.12c21fd1.js"><link rel="prefetch" href="/assets/js/40.cdc93796.js"><link rel="prefetch" href="/assets/js/41.be79357e.js"><link rel="prefetch" href="/assets/js/42.e1ea76e5.js"><link rel="prefetch" href="/assets/js/43.06bf748b.js"><link rel="prefetch" href="/assets/js/44.ebd514b5.js"><link rel="prefetch" href="/assets/js/45.12908515.js"><link rel="prefetch" href="/assets/js/46.78fb0130.js"><link rel="prefetch" href="/assets/js/47.1a3b6cfd.js"><link rel="prefetch" href="/assets/js/48.ef04cdb5.js"><link rel="prefetch" href="/assets/js/49.763fd944.js"><link rel="prefetch" href="/assets/js/5.e50a6b6f.js"><link rel="prefetch" href="/assets/js/50.c8a7a24d.js"><link rel="prefetch" href="/assets/js/51.aaa9e6c7.js"><link rel="prefetch" href="/assets/js/52.7d3a85df.js"><link rel="prefetch" href="/assets/js/53.00784a7d.js"><link rel="prefetch" href="/assets/js/54.bd3845dd.js"><link rel="prefetch" href="/assets/js/55.10a67e4b.js"><link rel="prefetch" href="/assets/js/56.eb909b5a.js"><link rel="prefetch" href="/assets/js/57.4b821666.js"><link rel="prefetch" href="/assets/js/58.1db8af1f.js"><link rel="prefetch" href="/assets/js/59.d3045373.js"><link rel="prefetch" href="/assets/js/6.f25d4964.js"><link rel="prefetch" href="/assets/js/60.e1ccc619.js"><link rel="prefetch" href="/assets/js/61.eca4de24.js"><link rel="prefetch" href="/assets/js/62.be7238a5.js"><link rel="prefetch" href="/assets/js/63.a6ace044.js"><link rel="prefetch" href="/assets/js/64.bbd88e61.js"><link rel="prefetch" href="/assets/js/65.1f6c1848.js"><link rel="prefetch" href="/assets/js/66.d22e96e8.js"><link rel="prefetch" href="/assets/js/67.80ad6427.js"><link rel="prefetch" href="/assets/js/68.130aa1f6.js"><link rel="prefetch" href="/assets/js/69.4b778299.js"><link rel="prefetch" href="/assets/js/7.9c54e368.js"><link rel="prefetch" href="/assets/js/70.edcac904.js"><link rel="prefetch" href="/assets/js/71.90aa842c.js"><link rel="prefetch" href="/assets/js/72.4cb8aeae.js"><link rel="prefetch" href="/assets/js/73.6f272f7b.js"><link rel="prefetch" href="/assets/js/75.362400fa.js"><link rel="prefetch" href="/assets/js/76.140673a6.js"><link rel="prefetch" href="/assets/js/77.5f1336bf.js"><link rel="prefetch" href="/assets/js/78.6105e050.js"><link rel="prefetch" href="/assets/js/79.dd680c04.js"><link rel="prefetch" href="/assets/js/8.786d8198.js"><link rel="prefetch" href="/assets/js/80.ebfb944f.js"><link rel="prefetch" href="/assets/js/81.892d344c.js"><link rel="prefetch" href="/assets/js/9.b0b77105.js">
    <link rel="stylesheet" href="/assets/css/0.styles.331bc503.css">
  </head>
  <body>
    <div id="app" data-server-rendered="true"><div class="theme-container no-sidebar"><header class="navbar"><div class="sidebar-button"><svg xmlns="http://www.w3.org/2000/svg" aria-hidden="true" role="img" viewBox="0 0 448 512" class="icon"><path fill="currentColor" d="M436 124H12c-6.627 0-12-5.373-12-12V80c0-6.627 5.373-12 12-12h424c6.627 0 12 5.373 12 12v32c0 6.627-5.373 12-12 12zm0 160H12c-6.627 0-12-5.373-12-12v-32c0-6.627 5.373-12 12-12h424c6.627 0 12 5.373 12 12v32c0 6.627-5.373 12-12 12zm0 160H12c-6.627 0-12-5.373-12-12v-32c0-6.627 5.373-12 12-12h424c6.627 0 12 5.373 12 12v32c0 6.627-5.373 12-12 12z"></path></svg></div> <a href="/" class="home-link router-link-active"><!----> <span class="site-name">Finen's Blog</span></a> <div class="links"><div class="search-box"><input aria-label="Search" autocomplete="off" spellcheck="false" value=""> <!----></div> <nav class="nav-links can-hide"><div class="nav-item"><a href="/post/" class="nav-link">
  文章
</a></div><div class="nav-item"><a href="/archives/" class="nav-link">
  归档
</a></div><div class="nav-item"><a href="/tags/" class="nav-link">
  标签云
</a></div><div class="nav-item"><a href="/friends/" class="nav-link">
  友人帐
</a></div> <a href="https://github.com/hirCodd" target="_blank" rel="noopener noreferrer" class="repo-link">
    GitHub
    <svg xmlns="http://www.w3.org/2000/svg" aria-hidden="true" x="0px" y="0px" viewBox="0 0 100 100" width="15" height="15" class="icon outbound"><path fill="currentColor" d="M18.8,85.1h56l0,0c2.2,0,4-1.8,4-4v-32h-8v28h-48v-48h28v-8h-32l0,0c-2.2,0-4,1.8-4,4v56C14.8,83.3,16.6,85.1,18.8,85.1z"></path> <polygon fill="currentColor" points="45.7,48.7 51.3,54.3 77.2,28.5 77.2,37.2 85.2,37.2 85.2,14.9 62.8,14.9 62.8,22.9 71.5,22.9"></polygon></svg></a></nav></div></header> <div class="sidebar-mask"></div> <aside class="sidebar"><nav class="nav-links"><div class="nav-item"><a href="/post/" class="nav-link">
  文章
</a></div><div class="nav-item"><a href="/archives/" class="nav-link">
  归档
</a></div><div class="nav-item"><a href="/tags/" class="nav-link">
  标签云
</a></div><div class="nav-item"><a href="/friends/" class="nav-link">
  友人帐
</a></div> <a href="https://github.com/hirCodd" target="_blank" rel="noopener noreferrer" class="repo-link">
    GitHub
    <svg xmlns="http://www.w3.org/2000/svg" aria-hidden="true" x="0px" y="0px" viewBox="0 0 100 100" width="15" height="15" class="icon outbound"><path fill="currentColor" d="M18.8,85.1h56l0,0c2.2,0,4-1.8,4-4v-32h-8v28h-48v-48h28v-8h-32l0,0c-2.2,0-4,1.8-4,4v56C14.8,83.3,16.6,85.1,18.8,85.1z"></path> <polygon fill="currentColor" points="45.7,48.7 51.3,54.3 77.2,28.5 77.2,37.2 85.2,37.2 85.2,14.9 62.8,14.9 62.8,22.9 71.5,22.9"></polygon></svg></a></nav>  <!----> </aside> <main class="page"> <div class="title-content"><div class="el-card box-cards is-never-shadow"><!----><div class="el-card__body"><h2>python爬取今日头条图集</h2> <div class="page-info"><i aria-hidden="true" class="el-icon-s-custom"></i> <span class="i-text">Finen</span> <i aria-hidden="true" class="el-icon-collection-tag"><span class="i-text">python</span></i><i aria-hidden="true" class="el-icon-collection-tag"><span class="i-text">python爬虫</span></i> <i aria-hidden="true" class="el-icon-time"></i> <span class="i-text">2018-03-30</span> <span id="/blog/python/python-crawling-toutiao-picture.html" data-flag-title="python爬取今日头条图集" class="leancloud-visitors"><i aria-hidden="true" class="el-icon-view"></i> <i class="leancloud-visitors-count">1000000</i></span></div></div></div></div> <div class="theme-default-content content__default"><h1 id="使用py爬取今日头条图集图片"><a href="#使用py爬取今日头条图集图片" class="header-anchor">#</a> 使用py爬取今日头条图集图片</h1> <h2 id="爬取图片并且下载到本地，同时，保存信息到mongodb中。"><a href="#爬取图片并且下载到本地，同时，保存信息到mongodb中。" class="header-anchor">#</a> 爬取图片并且下载到本地，同时，保存信息到mongoDB中。</h2> <p><strong>toutiao.py</strong></p> <div class="language-Python line-numbers-mode"><pre class="language-python"><code><span class="token keyword">import</span> json
<span class="token keyword">import</span> os
<span class="token keyword">from</span> hashlib <span class="token keyword">import</span> md5
<span class="token keyword">import</span> pymongo
<span class="token keyword">import</span> requests
<span class="token keyword">from</span> bs4 <span class="token keyword">import</span> BeautifulSoup
<span class="token keyword">from</span> requests<span class="token punctuation">.</span>exceptions <span class="token keyword">import</span> RequestException
<span class="token keyword">from</span> urllib<span class="token punctuation">.</span>parse <span class="token keyword">import</span> urlencode
<span class="token keyword">import</span> re
<span class="token keyword">from</span> config <span class="token keyword">import</span> <span class="token operator">*</span>
<span class="token keyword">from</span> multiprocessing <span class="token keyword">import</span> Pool


client <span class="token operator">=</span> pymongo<span class="token punctuation">.</span>MongoClient<span class="token punctuation">(</span>MONGO_URL<span class="token punctuation">,</span> connect<span class="token operator">=</span><span class="token boolean">False</span><span class="token punctuation">)</span>
db <span class="token operator">=</span> client<span class="token punctuation">[</span>MONGO_DB<span class="token punctuation">]</span>

headers <span class="token operator">=</span> <span class="token punctuation">{</span>
        <span class="token string">'User-Agent'</span><span class="token punctuation">:</span> <span class="token string">'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:56.0) Gecko/20100101 Firefox/56.0'</span><span class="token punctuation">,</span>
        <span class="token string">'Content-Type'</span><span class="token punctuation">:</span> <span class="token string">'application/x-www-form-urlencoded'</span><span class="token punctuation">,</span>
        <span class="token string">'Connection'</span><span class="token punctuation">:</span> <span class="token string">'Keep-Alive'</span><span class="token punctuation">,</span>
        <span class="token string">'Accept'</span><span class="token punctuation">:</span> <span class="token string">'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'</span>
    <span class="token punctuation">}</span>

<span class="token comment">#获取页面信息</span>
<span class="token keyword">def</span> <span class="token function">get_page_index</span><span class="token punctuation">(</span>offset<span class="token punctuation">,</span> keyword<span class="token punctuation">)</span><span class="token punctuation">:</span>
    <span class="token comment">#定义headers头</span>

    data <span class="token operator">=</span> <span class="token punctuation">{</span>
        <span class="token string">'offset'</span><span class="token punctuation">:</span> offset<span class="token punctuation">,</span>
        <span class="token string">'format'</span><span class="token punctuation">:</span> <span class="token string">'json'</span><span class="token punctuation">,</span>
        <span class="token string">'keyword'</span><span class="token punctuation">:</span> keyword<span class="token punctuation">,</span>
        <span class="token string">'autoload'</span><span class="token punctuation">:</span> <span class="token string">'true'</span><span class="token punctuation">,</span>
        <span class="token string">'count'</span><span class="token punctuation">:</span> <span class="token string">'20'</span><span class="token punctuation">,</span>
        <span class="token string">'cur_tab'</span><span class="token punctuation">:</span> <span class="token number">3</span><span class="token punctuation">,</span>
        <span class="token string">'from'</span><span class="token punctuation">:</span> <span class="token string">'gallery'</span>
    <span class="token punctuation">}</span>

    url <span class="token operator">=</span> <span class="token string">'https://www.toutiao.com/search_content/?'</span> <span class="token operator">+</span> urlencode<span class="token punctuation">(</span>data<span class="token punctuation">)</span>
    <span class="token keyword">try</span><span class="token punctuation">:</span>
        response <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">,</span> headers<span class="token operator">=</span>headers<span class="token punctuation">)</span>
        <span class="token keyword">if</span> response<span class="token punctuation">.</span>status_code <span class="token operator">==</span> <span class="token number">200</span><span class="token punctuation">:</span>
            <span class="token keyword">return</span> response<span class="token punctuation">.</span>text
        <span class="token keyword">return</span> <span class="token boolean">None</span>
    <span class="token keyword">except</span> RequestException<span class="token punctuation">:</span>
        <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'请求失败'</span><span class="token punctuation">)</span>
        <span class="token keyword">return</span> <span class="token boolean">None</span>

<span class="token comment">#索引</span>
<span class="token keyword">def</span> <span class="token function">parse_page_index</span><span class="token punctuation">(</span>html<span class="token punctuation">)</span><span class="token punctuation">:</span>
    data <span class="token operator">=</span> json<span class="token punctuation">.</span>loads<span class="token punctuation">(</span>html<span class="token punctuation">)</span>
    <span class="token keyword">if</span> data <span class="token keyword">and</span> <span class="token string">'data'</span> <span class="token keyword">in</span> data<span class="token punctuation">.</span>keys<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
        <span class="token keyword">for</span> item <span class="token keyword">in</span> data<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">'data'</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
            <span class="token keyword">yield</span> item<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">'article_url'</span><span class="token punctuation">)</span>


<span class="token comment">#获取详情页信息</span>
<span class="token keyword">def</span> <span class="token function">get_page_detail</span><span class="token punctuation">(</span>url<span class="token punctuation">)</span><span class="token punctuation">:</span>
    <span class="token keyword">try</span><span class="token punctuation">:</span>
        response <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">,</span> headers<span class="token operator">=</span>headers<span class="token punctuation">)</span>
        <span class="token keyword">if</span> response<span class="token punctuation">.</span>status_code <span class="token operator">==</span> <span class="token number">200</span><span class="token punctuation">:</span>
            <span class="token keyword">return</span> response<span class="token punctuation">.</span>text
        <span class="token keyword">return</span> <span class="token boolean">None</span>
    <span class="token keyword">except</span> RequestException<span class="token punctuation">:</span>
        <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'请求详情页错误'</span><span class="token punctuation">,</span> url<span class="token punctuation">)</span>
        <span class="token keyword">return</span> <span class="token boolean">None</span>

<span class="token comment"># 获取页面详情</span>
<span class="token keyword">def</span> <span class="token function">parse_page_detail</span><span class="token punctuation">(</span>html<span class="token punctuation">,</span> url<span class="token punctuation">)</span><span class="token punctuation">:</span>
    soup <span class="token operator">=</span> BeautifulSoup<span class="token punctuation">(</span>html<span class="token punctuation">,</span> <span class="token string">'lxml'</span><span class="token punctuation">)</span>
    <span class="token comment"># 标题正则对象</span>
    title_pattern <span class="token operator">=</span> re<span class="token punctuation">.</span><span class="token builtin">compile</span><span class="token punctuation">(</span><span class="token string">'title:(.*?),'</span><span class="token punctuation">,</span> re<span class="token punctuation">.</span>S<span class="token punctuation">)</span>
    <span class="token comment"># 查找</span>
    result_title <span class="token operator">=</span> re<span class="token punctuation">.</span>search<span class="token punctuation">(</span>title_pattern<span class="token punctuation">,</span> html<span class="token punctuation">)</span>
    <span class="token keyword">if</span> result_title<span class="token punctuation">:</span>
        title_data <span class="token operator">=</span> result_title<span class="token punctuation">.</span>group<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span>
        <span class="token comment"># print(title_data)</span>

    <span class="token comment"># 图片正则表达式对象</span>
    image_pattern <span class="token operator">=</span> re<span class="token punctuation">.</span><span class="token builtin">compile</span><span class="token punctuation">(</span><span class="token string">'gallery: JSON.parse\(&quot;(.*?)&quot;\)'</span><span class="token punctuation">,</span> re<span class="token punctuation">.</span>S<span class="token punctuation">)</span>
    result_image <span class="token operator">=</span> re<span class="token punctuation">.</span>search<span class="token punctuation">(</span>image_pattern<span class="token punctuation">,</span> html<span class="token punctuation">)</span>
    <span class="token comment"># 替换不需要的数据</span>
    jsonImage <span class="token operator">=</span> re<span class="token punctuation">.</span>sub<span class="token punctuation">(</span><span class="token string">r'\\{1,2}'</span><span class="token punctuation">,</span><span class="token string">''</span><span class="token punctuation">,</span>result_image<span class="token punctuation">.</span>group<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">)</span>
    <span class="token keyword">if</span> result_image<span class="token punctuation">:</span>
        image_data <span class="token operator">=</span> json<span class="token punctuation">.</span>loads<span class="token punctuation">(</span>jsonImage<span class="token punctuation">)</span>
        <span class="token comment"># print(result_image.group(1)) # 测试</span>

        <span class="token keyword">if</span> image_data <span class="token keyword">and</span> <span class="token string">'sub_images'</span> <span class="token keyword">in</span> image_data<span class="token punctuation">.</span>keys<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span>
            sub_images <span class="token operator">=</span> image_data<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">'sub_images'</span><span class="token punctuation">)</span>
            <span class="token comment"># print(sub_images) # 测试</span>
            <span class="token comment"># 装换成数组</span>
            images <span class="token operator">=</span> <span class="token punctuation">[</span>item<span class="token punctuation">.</span>get<span class="token punctuation">(</span><span class="token string">'url'</span><span class="token punctuation">)</span> <span class="token keyword">for</span> item <span class="token keyword">in</span> sub_images<span class="token punctuation">]</span>
            <span class="token comment"># 下载图片</span>
            <span class="token keyword">for</span> image <span class="token keyword">in</span> images<span class="token punctuation">:</span> download_image<span class="token punctuation">(</span>image<span class="token punctuation">)</span>
            <span class="token keyword">return</span> <span class="token punctuation">{</span>
                <span class="token string">'title'</span><span class="token punctuation">:</span> title_data<span class="token punctuation">,</span>
                <span class="token string">'url'</span><span class="token punctuation">:</span> url<span class="token punctuation">,</span>
                <span class="token string">'images'</span><span class="token punctuation">:</span> images
            <span class="token punctuation">}</span>

<span class="token comment"># 下载图片</span>
<span class="token keyword">def</span> <span class="token function">download_image</span><span class="token punctuation">(</span>url<span class="token punctuation">)</span><span class="token punctuation">:</span>
    <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'正在下载：'</span><span class="token punctuation">,</span> url<span class="token punctuation">)</span>
    <span class="token keyword">try</span><span class="token punctuation">:</span>
        response <span class="token operator">=</span> requests<span class="token punctuation">.</span>get<span class="token punctuation">(</span>url<span class="token punctuation">,</span> headers<span class="token operator">=</span>headers<span class="token punctuation">)</span>
        <span class="token keyword">if</span> response<span class="token punctuation">.</span>status_code <span class="token operator">==</span> <span class="token number">200</span><span class="token punctuation">:</span>
            save_image<span class="token punctuation">(</span>response<span class="token punctuation">.</span>content<span class="token punctuation">)</span>
        <span class="token keyword">return</span> <span class="token boolean">None</span>
    <span class="token keyword">except</span> RequestException<span class="token punctuation">:</span>
        <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'请求图片错误'</span><span class="token punctuation">,</span> url<span class="token punctuation">)</span>
        <span class="token keyword">return</span> <span class="token boolean">None</span>

<span class="token comment"># 存储图片</span>
<span class="token keyword">def</span> <span class="token function">save_image</span><span class="token punctuation">(</span>content<span class="token punctuation">)</span><span class="token punctuation">:</span>
    <span class="token comment"># 设置路径</span>
    file_path <span class="token operator">=</span> <span class="token string">'{0}/{1}.{2}'</span><span class="token punctuation">.</span><span class="token builtin">format</span><span class="token punctuation">(</span>os<span class="token punctuation">.</span>getcwd<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> md5<span class="token punctuation">(</span>content<span class="token punctuation">)</span><span class="token punctuation">.</span>hexdigest<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token string">'jpg'</span><span class="token punctuation">)</span>
    <span class="token keyword">if</span> <span class="token keyword">not</span> os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>exists<span class="token punctuation">(</span>file_path<span class="token punctuation">)</span><span class="token punctuation">:</span>
        <span class="token keyword">with</span> <span class="token builtin">open</span><span class="token punctuation">(</span>file_path<span class="token punctuation">,</span> <span class="token string">'wb'</span><span class="token punctuation">)</span> <span class="token keyword">as</span> f<span class="token punctuation">:</span>
            f<span class="token punctuation">.</span>write<span class="token punctuation">(</span>content<span class="token punctuation">)</span>
            f<span class="token punctuation">.</span>close<span class="token punctuation">(</span><span class="token punctuation">)</span>

<span class="token comment"># 存储到mongoDB</span>
<span class="token keyword">def</span> <span class="token function">save_to_mongo</span><span class="token punctuation">(</span>result<span class="token punctuation">)</span><span class="token punctuation">:</span>
    <span class="token keyword">if</span> db<span class="token punctuation">[</span>MONGO_TABLE<span class="token punctuation">]</span><span class="token punctuation">.</span>insert<span class="token punctuation">(</span>result<span class="token punctuation">)</span><span class="token punctuation">:</span>
        <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">'存储成功'</span><span class="token punctuation">,</span> result<span class="token punctuation">)</span>
        <span class="token keyword">return</span> <span class="token boolean">True</span>
    <span class="token keyword">return</span> <span class="token boolean">False</span>

<span class="token keyword">def</span> <span class="token function">main</span><span class="token punctuation">(</span>offset<span class="token punctuation">)</span><span class="token punctuation">:</span>
    html <span class="token operator">=</span> get_page_index<span class="token punctuation">(</span>offset<span class="token punctuation">,</span> KEYWORD<span class="token punctuation">)</span>
    <span class="token keyword">for</span> url <span class="token keyword">in</span> parse_page_index<span class="token punctuation">(</span>html<span class="token punctuation">)</span><span class="token punctuation">:</span>
        html <span class="token operator">=</span> get_page_detail<span class="token punctuation">(</span>url<span class="token punctuation">)</span>
        <span class="token keyword">if</span> html<span class="token punctuation">:</span>
            result <span class="token operator">=</span> parse_page_detail<span class="token punctuation">(</span>html<span class="token punctuation">,</span> url<span class="token punctuation">)</span>
            <span class="token keyword">if</span> result<span class="token punctuation">:</span> save_to_mongo<span class="token punctuation">(</span>result<span class="token punctuation">)</span>

<span class="token keyword">if</span> __name__ <span class="token operator">==</span> <span class="token string">'__main__'</span><span class="token punctuation">:</span>
    groups <span class="token operator">=</span> <span class="token punctuation">[</span>x<span class="token operator">*</span><span class="token number">20</span> <span class="token keyword">for</span> x <span class="token keyword">in</span> <span class="token builtin">range</span><span class="token punctuation">(</span>GROUP_START<span class="token punctuation">,</span> GROUP_END<span class="token operator">+</span><span class="token number">1</span><span class="token punctuation">)</span><span class="token punctuation">]</span>
    <span class="token comment"># 开启多线程下载</span>
    pool <span class="token operator">=</span> Pool<span class="token punctuation">(</span><span class="token punctuation">)</span>
    pool<span class="token punctuation">.</span><span class="token builtin">map</span><span class="token punctuation">(</span>main<span class="token punctuation">,</span> groups<span class="token punctuation">)</span>
    <span class="token comment"># main()</span>

</code></pre> <div class="line-numbers-wrapper"><span class="line-number">1</span><br><span class="line-number">2</span><br><span class="line-number">3</span><br><span class="line-number">4</span><br><span class="line-number">5</span><br><span class="line-number">6</span><br><span class="line-number">7</span><br><span class="line-number">8</span><br><span class="line-number">9</span><br><span class="line-number">10</span><br><span class="line-number">11</span><br><span class="line-number">12</span><br><span class="line-number">13</span><br><span class="line-number">14</span><br><span class="line-number">15</span><br><span class="line-number">16</span><br><span class="line-number">17</span><br><span class="line-number">18</span><br><span class="line-number">19</span><br><span class="line-number">20</span><br><span class="line-number">21</span><br><span class="line-number">22</span><br><span class="line-number">23</span><br><span class="line-number">24</span><br><span class="line-number">25</span><br><span class="line-number">26</span><br><span class="line-number">27</span><br><span class="line-number">28</span><br><span class="line-number">29</span><br><span class="line-number">30</span><br><span class="line-number">31</span><br><span class="line-number">32</span><br><span class="line-number">33</span><br><span class="line-number">34</span><br><span class="line-number">35</span><br><span class="line-number">36</span><br><span class="line-number">37</span><br><span class="line-number">38</span><br><span class="line-number">39</span><br><span class="line-number">40</span><br><span class="line-number">41</span><br><span class="line-number">42</span><br><span class="line-number">43</span><br><span class="line-number">44</span><br><span class="line-number">45</span><br><span class="line-number">46</span><br><span class="line-number">47</span><br><span class="line-number">48</span><br><span class="line-number">49</span><br><span class="line-number">50</span><br><span class="line-number">51</span><br><span class="line-number">52</span><br><span class="line-number">53</span><br><span class="line-number">54</span><br><span class="line-number">55</span><br><span class="line-number">56</span><br><span class="line-number">57</span><br><span class="line-number">58</span><br><span class="line-number">59</span><br><span class="line-number">60</span><br><span class="line-number">61</span><br><span class="line-number">62</span><br><span class="line-number">63</span><br><span class="line-number">64</span><br><span class="line-number">65</span><br><span class="line-number">66</span><br><span class="line-number">67</span><br><span class="line-number">68</span><br><span class="line-number">69</span><br><span class="line-number">70</span><br><span class="line-number">71</span><br><span class="line-number">72</span><br><span class="line-number">73</span><br><span class="line-number">74</span><br><span class="line-number">75</span><br><span class="line-number">76</span><br><span class="line-number">77</span><br><span class="line-number">78</span><br><span class="line-number">79</span><br><span class="line-number">80</span><br><span class="line-number">81</span><br><span class="line-number">82</span><br><span class="line-number">83</span><br><span class="line-number">84</span><br><span class="line-number">85</span><br><span class="line-number">86</span><br><span class="line-number">87</span><br><span class="line-number">88</span><br><span class="line-number">89</span><br><span class="line-number">90</span><br><span class="line-number">91</span><br><span class="line-number">92</span><br><span class="line-number">93</span><br><span class="line-number">94</span><br><span class="line-number">95</span><br><span class="line-number">96</span><br><span class="line-number">97</span><br><span class="line-number">98</span><br><span class="line-number">99</span><br><span class="line-number">100</span><br><span class="line-number">101</span><br><span class="line-number">102</span><br><span class="line-number">103</span><br><span class="line-number">104</span><br><span class="line-number">105</span><br><span class="line-number">106</span><br><span class="line-number">107</span><br><span class="line-number">108</span><br><span class="line-number">109</span><br><span class="line-number">110</span><br><span class="line-number">111</span><br><span class="line-number">112</span><br><span class="line-number">113</span><br><span class="line-number">114</span><br><span class="line-number">115</span><br><span class="line-number">116</span><br><span class="line-number">117</span><br><span class="line-number">118</span><br><span class="line-number">119</span><br><span class="line-number">120</span><br><span class="line-number">121</span><br><span class="line-number">122</span><br><span class="line-number">123</span><br><span class="line-number">124</span><br><span class="line-number">125</span><br><span class="line-number">126</span><br><span class="line-number">127</span><br><span class="line-number">128</span><br><span class="line-number">129</span><br><span class="line-number">130</span><br><span class="line-number">131</span><br><span class="line-number">132</span><br><span class="line-number">133</span><br><span class="line-number">134</span><br><span class="line-number">135</span><br><span class="line-number">136</span><br><span class="line-number">137</span><br><span class="line-number">138</span><br><span class="line-number">139</span><br><span class="line-number">140</span><br><span class="line-number">141</span><br><span class="line-number">142</span><br></div></div><h1 id="全局配置文件"><a href="#全局配置文件" class="header-anchor">#</a> 全局配置文件</h1> <p>在这里可以自定义全局配置信息：</p> <h2 id="在keyword可以自定义关键字信息，mongodb的数据密码可能需要填写"><a href="#在keyword可以自定义关键字信息，mongodb的数据密码可能需要填写" class="header-anchor">#</a> 在KEYWORD可以自定义关键字信息，mongoDB的数据密码可能需要填写</h2> <p><strong>config.py</strong></p> <div class="language- line-numbers-mode"><pre class="language-text"><code>MONGO_URL = 'localhost'
MONGO_DB = 'toutiao'
MONGO_TABLE = 'toutiao'
# 如果没有mongodb密码就不用写数据库密码了，如果有就需要填写一下
GROUP_START = 1
GROUP_END = 20
KEYWORD = &quot;街拍&quot;
</code></pre> <div class="line-numbers-wrapper"><span class="line-number">1</span><br><span class="line-number">2</span><br><span class="line-number">3</span><br><span class="line-number">4</span><br><span class="line-number">5</span><br><span class="line-number">6</span><br><span class="line-number">7</span><br></div></div><h1 id="结果如下图："><a href="#结果如下图：" class="header-anchor">#</a> 结果如下图：</h1> <p><img src="https://img-blog.csdn.net/20180330174630353?watermark/2/text/aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L0hvb2tKb255/font/5a6L5L2T/fontsize/400/fill/I0JBQkFCMA==/dissolve/70" alt="Picture"></p></div> <footer class="page-edit"><!----> <!----></footer> <!---->  <div class="vcomment"><div id="vcomments"></div></div></main></div><div class="global-ui"><!----><!----></div></div>
    <script src="/assets/js/app.8657dc36.js" defer></script><script src="/assets/js/3.655fda8f.js" defer></script><script src="/assets/js/74.d36a7a16.js" defer></script><script src="/assets/js/11.32858d9c.js" defer></script>
  </body>
</html>
